† Corresponding author. E-mail:
Project supported partially by the National Natural Science Foundation of China (Grant No. 11705097), the Natural Science Foundation of JiangsuProvince of China (Grant No. BK20170895), the Jiangsu Government Scholarship for Overseas Studies of 2018 and Scientific Research Foundation of Nanjing University of Posts and Telecommunications, China (Grant No. NY217013), and the Foundation for Polish Science through TEAM-NET Project (Grant No. POIR.04.04.00-00-17C1/18-00).
As a problem in data science the inverse Ising (or Potts) problem is to infer the parameters of a Gibbs–Boltzmann distributions of an Ising (or Potts) model from samples drawn from that distribution. The algorithmic and computational interest stems from the fact that this inference task cannot be carried out efficiently by the maximum likelihood criterion, since the normalizing constant of the distribution (the partition function) cannot be calculated exactly and efficiently. The practical interest on the other hand flows from several outstanding applications, of which the most well known has been predicting spatial contacts in protein structures from tables of homologous protein sequences. Most applications to date have been to data that has been produced by a dynamical process which, as far as it is known, cannot be expected to satisfy detailed balance. There is therefore no a priori reason to expect the distribution to be of the Gibbs–Boltzmann type, and no a priori reason to expect that inverse Ising (or Potts) techniques should yield useful information. In this review we discuss two types of problems where progress nevertheless can be made. We find that depending on model parameters there are phases where, in fact, the distribution is close to Gibbs–Boltzmann distribution, a non-equilibrium nature of the under-lying dynamics notwithstanding. We also discuss the relation between inferred Ising model parameters and parameters of the underlying dynamics.
The Gibbs–Boltzmann distribution of the Ising model on L spin (For later reference we prefer to refer to the number of spins in the model with the letter L, for “loci”. The more customary letter N will later be reserved to the number of samples drawn from the distribution, following a convention using in statistics) is
From the viewpoint of physics, Eq. (
To stress the inverse nature of the problem it is useful to introduce some notation from statistics. The class of distributions (
A fundamental fact of statistical inference, which holds for all exponential families, is that maximum likelihood does not need all the data. Indeed, in Eq. (
In physics, Eq. (
If we focus on single-spin flips and P(s) in Eq. (
To give an example of a spin-flip dynamics which does not satisfy detailed balance we point to the class of focused algorithms for constraint satisfaction problems, invented by Papadimitriou now three decades ago.[18–26] In such algorithms one imagines that the energy function is a sum of local terms all of which are one or zero. A solution is a configuration where all the energy terms are zero (zero-energy ground state). A focused algorithm is one where the rate of flipping spin i is zero unless at least one of the constraints depending on i is unsatisfied, but otherwise the dynamics remains partly random. It is clear that for such dynamics one can flip into a satisfied state, but once there the dynamics stops; one cannot flip out (the first condition of focusing can be satisfied in the equilibrium algorithm (
Let us now go back to the problem of inferring the parameters of the Ising model in Eq. (
In equilibrium statistical mechanics the answer is clear and simple: the process will make sense if the data is generated by a process in detailed balance with an energy function in the same exponential family, and in a phase where sampling is possible. The first condition simply means that if the data is generated from a process with, say, third-order interactions between the spins, those interactions will not be recovered from inferring only first-order and second-order interactions. The second conditions means that parameters have to be such that the dynamics explores enough configurations that there is enough information to infer from. A trivial example when this is not the case is zero temperature where the configuration goes to a local minimum of the energy, and then does not change. A more subtle example is a spin glass phase where for large but not infinite β only part of the Gibbs distribution (
Once we step out of the realm of equilibrium dynamics we are much more in the dark. For the specific example of symmetric simple exclusion process (SSEP) it is known that the stationary distribution, i.e. the equivalent of Eq. (
In this review we will present two cases where the above problem can be analyzed and/or studied in simulations. The first case is kinetic Ising models with possibly different values of pairwise parameters Jij and Jji. When Jij = Jji (symmetric kinetic Ising models) this is nothing by a Monte Carlo procedure to compute the distribution P(s) in Eq. (
The second case are slightly more involved spin dynamics that model evolution under mutations, Darwinian selection (fitness), finite-N effects (genetic drift) and recombination (sex). We will here see that inverse Ising works in certain ranges of parameters describing the relative strengths of mutations, fitness and sex, but not in others. We will also see that the relation is not trivial; non-trivial theory is needed to translate the results from inverse Ising to inferred fitness that can be compared to model parameters.
This review is organized as follows. In Section
The inverse Ising problem has been studied under several different names, such as statistical inference in exponential families (as above), Boltzmann machines, maximum-entropy modeling, direct coupling analysis (DCA), logistic regression techniques, and so on. For small enough system (small enough L) maximum likelihood in Eq. (
For large L, maximum likelihood (ML) is not computationally efficient, meaning that it requires an effort exponentially increasing in L. In other words, for a given fixed L, if ML is computationally feasible depends on time and the development of computer hardware. Nevertheless, for many applications that have been of interest, either ML has not been feasible, or other inference schemes have given comparable results with less effort. In any case, it has been an interesting theoretical challenge to design and analyze schemes that make a different trade-off between accuracy and computational speed than ML.
The state of the art of inverse Ising was recently extensively reviewed in Ref. [6], and we will here only provide a background for the later sections. A first type of inference methods attempts to circumvent the computational challenge of ML by estimating the partition function Z efficiently. Such methods are collectively known as mean-field inference, because they rely on mean-field techniques. The by far most common version of mean-field inference relies on a variational ansatz in terms of magnetizations, which yields the physical mean-field equations of the Ising model,
A general feature of inference methods of this type is that in the variational ansatz the data is only taken into account through the single-variables marginals, i.e. through the magnetizations. It is only linear-response Eq. (
Another type of mean-field inference equation attempts to find the Ising model which best fits the data. The variational parameters are then magnetizations (mi) and correlations (cij), conjugate to model parameters hi and Jij. This approach was first developed as an iterative procedure called “susceptibility propagation”[42,43] and only later shown to also yield equations like Eqs. (
A different type of inference gives up on the ambition to approximate the partition function, and hence the full probability distribution P(s). Instead, one tries to infer the parameters from some other property which can be efficiently computed. The most widely used such method is maximum pseudo-likelihood[45] or pseudo-likelihood maximization (PLM). This starts from the conditional probability of the Ising model
In the limit of infinite data, PLM will almost surely find the same parameters as ML, a property referred to as statistical consistency (The formal definition of statistical consistency is that as the number of samples goes to infinity, the argmin of the estimator converges in probability to the right answer. this holds for ML and PLM and some other inference methods to be discussed below, but does not hold for mean-field inference methods. In the limit of infinite data the sample averages used in mean-field will always surely be the same as ensemble averages, but the recovered parameters will not be the true ones because physical mean-field is in itself approximate. For a discussion, see e.g.[6] and references cited therein.). In applications PLM has often been found to outperform both naive and advanced mean-field inference.[6] Why that is so cannot be said to be completely known, since the number of samples in real data sets is finite. The error of mean-field inference compared to PLM in the infinite sample limit (lack of statistical consistency) could therefore be compensated by the error in PLM when used on a finite number of samples. Empirically this has mostly not been found to be the case, but that may partially be a consequence of the kinds of data sets that have been considered in the literature.
High-dimensional statistics is the branch of modern statistics where the number of samples (N) is assumed to grow together with or slower than the number of parameters (here
Nevertheless, the rule-of-thumb points to something important, namely that in the important application of inverse Potts methods to contact prediction in protein structures,[46,47] the number of parameters (for 20 types of amino acids in a protein of 100 residues) is typically about 202 ⋅ 1002, which is four million, while the number of samples is rarely more than a hundred thousand. All inference methods outlined above are therefore in this application used in regimes where they are under-sampled, and so need to be regularized. For naive mean-field inference a regularization by pseudo-counts (adding fictitious uniformly distributed samples) was used in Refs. [46,47], while an L1-regularization was used in Ref. [48], and an L2-regularization in Ref. [49]. For PLM similarly L2-regularization was used in Refs. [50–52].
An important aspect of all inference is what is the family from which one tries to infer parameters. This can be given in a Bayesian interpretation as an a priori distribution of parameters; the more one knows in that direction, the better the inference can be. Many regularizers can be seen as logarithms of Bayesian prior distributions such that the analogy also works the other way: regularized inference is equivalent to inference with a prior (exponential of the regularizer), and can therefore work better because it uses more information. For instance, if all parameters are supposed to be either zero or bounded away from zero by some lower threshold value, and if the ones that are non-zero are sparse, then the authors of Ref. [53] showed that L1-regularized PLM can find the graph structure using relatively few samples, given certain assumption that were later shown to be restrictive.[54] Nevertheless, using and analyzing thresholding also in the retained predictions, the authors of Ref. [55] were able to show that L1-regularized PLM can indeed find the graph structure using order of log L samples (the same authors also showed that L1-regularized PLM with thresholding, as used in the plmDCA software of Ref. [52] can recover parameters in L2 norm using order of log L samples).
A second and equally important aspect is the evaluation criteria. The criterion in Ref. [53] is graphical: the objective is to infer properties of the model (the non-zero interactions) which can be represented as a graph. It is obvious that inference under this criterion will be difficult without a gap in the distribution of interaction parameters away from zero. Information theory imposes limits on the smallest couplings that can be retrieved from the finite amount of data;[55,56] given finite data it is simply not possible to distinguish a parameter which is strictly zero from one which is only very small. Another type of criterion is metrical, most often the squared differences of the actual and inferred parameter values.[6,27] Yet another is probabilistic by determining some difference between the two probability distributions as in Eq. (
Important results have been obtained as to how many samples are required for successful inference when both the a priori distributions and and the criteria are varied, first in Ref. [53] under strong assumptions, and more recently in Refs. [55,57]. These two latter papers also introduced a different objective function, interaction screening objective (ISO), that has dependence on the same local quantities as pseudo-likelihood, and which provably outperforms PLM in terms of expected error for the given number of samples, providing near sample-optimal guarantee. ISO has also more recently been generalized to learn Ising models in the presence of samples corrupted by independent noise,[58] and to the case of Potts models and beyond pairwise interactions.[59]
In practice and in many successful applications to real data, criteria have been of the type “correctly recovering k largest interactions”, colloquially known as “top-k”. Performance under such criteria is straight-forward to analyze empirically when there is a known answer; one simply compiles two lists of k largest parameters and what interactions they refer to, and then compares the two lists. For instance, one can check what fraction of k largest inferred interactions can also be found among the k largest actual interactions, which is known as k-true positive rate, or TPR(k). In the application of inverse Potts methods to contact prediction in protein structures,[46,47] k has commonly been taken to be around 100. The inequality that the number of retained parameters is less than the number of samples has hence been respected, with a large margin. The theoretical analysis of performance under this type of criterion is however more involved, as the distribution of the largest values of a random background is an extreme deviations problem. One approach is to leverage an L∞ norm guarantee,[55,59] for another using large deviation theory, see Refs. [60,61].
In Section
Nevertheless, even if the data was generated in a dynamic process, we do not always have time series data. In Sections
Averages at any given time will have errors which go down as
A standard approach to sample the equilibrium Ising model is Glauber dynamics.[63,64] On the level of probability distributions it is formulated as master equations
For small enough systems (small L), Eq. (
For simplicity of presentation we will here focus on the time-homogeneous case in which all parameters are time independent. Distributions will then eventually relax to a stationary state, and we will assume that this process has taken place. Inference can then by carried out by treating samples at different times independently, i.e., by the type of alltime algorithms discussed in Section
The dynamics of a configuration s(t) is governed by the same rates as in Eq. (
One approach to simulation is to introduce a small time step increment δt and to flip each spin at each time with probability ωi(s,t). For this scheme to simulate Eq. (
A computationally more efficient scheme is to first consider the rate of the event of flipping any spin. That is,
A third approach is to update at each step a spin i picked uniformly at random with probability γδt. After such an update, which may or may not change the spin value, the new value will be
As illustrative examples we will now look at symmetric and asymmetric SK models,[66] which are defined as follows. First we introduce Jij with no restriction on i and j. Such a matrix can be split into its symmetric and asymmetric parts. We write
The interactions Jij define spin update rates (
Many techniques for inverse Ising as discussed above in Section
We now derive versions of nMF and TAP inference for asynchronously updated kinetic models following.[81]
For kinetic Ising model with Glauber dynamics, the state of spin i is time dependent si(t), thus the time-dependent means and correlations are naturally defined as
We introduce the notation
The first non-trivial equation is obtained by expanding Eq. (
Figure
By a similar procedure we can also derive a higher-order approximation, which we refer to as dynamic TAP. The starting point is to redefine the bi(t) term in the tanh to include a term analogous to the static TAP Eq. (
Equation (
To emphasize how different is inference from a time series compared to from samples, we will now show that maximum likelihood inference of such dynamics from such data is possible. We will also show that this approach admits approximation schemes different from mean-field. The presentation will follow.[82]
The log-likelihood of observing a full time series of a set of interacting spins is analogous to the probability of a history of a Poisson point process.[15] The probability space of events in some time period [0:t] consists of the number of jumps (n), the times of these jumps (t1,t2,…,tn) and which spin jumps at each time (i1,i2,…,in). The measure over this space is proportional to the uniform measure over n times a weight
A rigorous construction of the above path probability can be found in Appendix A of Ref. [83]. Here we will follow a more heuristic approach and introduce a small finite time δt such that we can use the first simulation approach discussed in Section
Separating times with and without spin flips (
Similarly to mean-field inference, Eq. (
All of four methods now introduced to infer parameters from a time series (nMF, TAP, SHO and AVE) will produce a fully connected network structure. Similarly to inverse Ising from samples we may want to include L1 penalties to get the graphical structure.[84] Such effects are considered in Ref. [85], showing that inferring the sparsity structure from time series data is both a feasible and reliable procedure.
In this section, performance tests of the four above introduced algorithms for recovering parameters in asynchronous Ising models are presented. We compared the performance of two ML algorithms SHO, and AVE to each other and to two mean-field algorithms nMF and TAP.
The tested model is as discussed above the fully asymmetric SK model (Jij is independent of Jji), Jij’s are identically and independently distributed Gaussian variables with zero means and variance g2/N. As a performance measure, we use the mean square error (ε) which measures the L2 distance between the inferred parameters and the underlying parameters used to generate the data
Figure
To summarize, the ML methods recover the model better, but in general more slowly. The mean-field based learning rules (nMF and TAP) are much faster in inferring the couplings but have worse accuracy compared with that of the ML-based iterative learning rules (AVE, SHO).
Inverse Ising problems have been applied to a wide rage of data analysis, ranging from equilibrium reconstruction methods to kinetic ones. In this section, based on Refs. [81,86], we will present as illustrations applications to one data set of neuronal spike trains, and one data set on transaction data of stocks on financial market. Both areas have been investigated extensively in the last ten years. We refer to Refs. [87–93] for more recent neuronal data and discussions of inference in this context and to Refs. [94–105] for a sample of contributions considering financial data.
For the neuronal data, we show two ML-based learning rules. When considering the data as a time series we use the AVE method of Eq. (
For the financial stock trade data we show two mean-field-based algorithms. When considering the data as a time series we use the asynchronous nMF method of Eq. (
Neurons are the computational units of the brain. While each neuron is actually a cell with complicated internal structure, there is a long history of considering simplified models where the state of a neuron at a given time is a Boolean variable. Zero (or down, or – 1) then means resting, and one (or up, or + 1) means firing, or having a spike of activity. In most neural data most neurons are resting most of the time.
Data description and representation of data The neuronal spike trains are from salamander retina under stimulation by a repeated 26.5-second movie clip. This data set records the spiking times for neurons and has a data length of 3180 s (120 repetitions of the movie clip). Here, only the first N = 20 neurons with highest firing rates in the data set are considered. The data has been binned with time windows of 20 ms (the typical time scale of the auto-correlation function of a neuron) in the previous study.[106] However, since we are using the kinetic model, we could study this data set using a much shorter time bin which leads low enough firing rates and (almost) never more than one spike per bin. Then, the temporal correlations with time delays between neuron pairs as well as the self-correlations become important.
For the asynchronous Ising model, the time bins are δt = 1/(γN). For neuronal data, γ can be interpreted as the inverse of the time length of the auto-correlation function, which is typically 10 ms or more.[106] To generate the binary spin history from this spike train data set, the spike trains should be separated into time bins with length γδt = 1/20. This means that the size of time bins should be chosen as δt = 1/(20γ) = 0.5 ms. The spin trains can be transformed in to binaries as follows: a + 1 is assigned to every time bin in which there is a spike and a – 1 when there is no spikes. To avoid the case that the translation always end up with isolated instances of + 1 and superfluous – 1 s, the memory process for each neuron is introduced to the data set. It is a time period with an exponential distribution with mean of 1/γ in the data translation. Denote the total firing number of neuron i as Fi, and
Inference methods For this fairly small system we use two types of ML to learn the parameters of Eq. (
Inference results In the current inference of retina functional connections, the value of model parameters like window size δt, inverse time scale γ are set as a priori according to the previous studies on equilibrium Ising model. This avoids systematic studies over the value of parameters.
As presented in Fig.
However, the asynchronous model allows the inference of self-couplings (diagonal elements of the coupling matrix) which are not present in the equilibrium model. As shown in Fig.
This result provides a guide for the use of the equilibrium Ising model: if the asynchronous couplings are far away from the equilibrium ones, it will imply that the real dynamical process does not satisfy the Gibbs equilibrium conditions and that the final distribution of states is not the Gibbs equilibrium Ising model. Since inferring the equilibrium model is an exponentially difficult problem, requiring time consuming for Monte Carlo sampling while the asynchronous approach does not. The asynchronous learning rules thus allow the inference of functional connections that for the retinal data largely agree with the equilibrium model, but the inference is much faster.
In this study, we present equilibrium nMF (
Data description and representation The data was transactions recordings on the New York Stock Exchange (NYSE) over a few years. Each trade is characterized by a time, a traded volume, and a price. We only focus on the trades for 100 trading days between 02.01.2003 and 30.05.2003. However, trading volume and trading time only are utilized in the study. To avoid the opening and closing periods of the stock exchange, 104 central seconds of each day are employed as in Ref. [107]. Two parameters are introduced to the data transform as the sliding window is adopted. One is the size of the sliding time window (denoted as Δt), the other one is the shifting constant which is fixed as 1 s.
For stock i, the sum of the volumes Vi(t,Δt) traded in window [t, t + Δt) is compared with a given volume threshold
Figure
Inference methods With the transformed binaries, the magnetization mi and correlations Cij(τ) are defined as Eq. (
Equilibrium nMF (i ≠ j), which only focuses on equal time correlations[108]
Reconstruction results Massive explorations over different values of the window size Δt and χ’s are complimented to achieve meaningful interactions between stocks. A natural rough approach is to consider that couplings will contain interesting information if they are big in absolute value: they indicate a strong interaction between stocks. For asynchronous inference, the derivative of the time-lagged correlations
Figure
The similarity of interaction matrices J and J′ inferred from different methods can be measured by a similarity quantity QJ,J′, which is defined as
Next, we will present two inferred financial networks that recovered by equilibrium and asynchronous nMF method, respectively. As the inferred finance networks are densely connected, we focus only on the largest couplings, which can be explained by closely related activities of the considered stocks. Figure
The network of Fig.
GE (General Electrics) is for a large range of parameters a very central node, which is consistent with its diversified activities. Figure
The banking sector as shown by a chain with light blue color is linked to the sector of electronic technology (with dark blue color). Moreover, the defense and aerospace sector as shown in magenta is linked to engines and machinery with (CAT) (Caterpillar Inc.) and DE (John Deere), and more strangely, to packaged food with CAG (ConAgra Foods), SYY (Sysco) and K (Kellogg Company).
Figure
From the network samples, we have the following two basic conclusions. First, they show market mode (most of the interaction strengths found are usually positive, which indicates that the financial market has a clear collective behavior)[110,111] even only trade and volume information is considered. Stocks tend to be traded or not traded at the same time.
In addition, the strongest inferred interactions can be easily understood by similarities in the industrial activities of the considered stocks. This means that financial activity tends to concentrate on a certain activity sector at a certain time. For price dynamics this phenomenon is well-known,[109,112,114] but it is perhaps more surprising that it appeared also based on only information of traded volumes.
We now turn to inverse Ising (Potts) techniques applied to sequence-type biological data. This has variously been called direct coupling analysis (DCA)[36,43,46,47] and max-entropy modeling;[35,115] as noted above, other names are also in use. We will use the terms inverse Ising and DCA interchangeably.
A common feature of all these applications is that the input is a static table of N ⋅ L symbols. Each row is a sequence of L symbols from data, and there are N such rows (N samples). A breakthrough application has been to identify residues (amino acid molecules) that are spatially close in proteins (chains of amino acids). The table then represents a family of proteins with supposedly similar structure and supposedly same origin, and each row is the amino acid sequence of a member of that family.[36,43,46,47] The basic idea is that two columns in the table (two positions in the protein structure) have non-trivial statistical dependency if their joint variation influence biological fitness. Such co-dependency in biological fitness is called epistasis. The most immediate cause of epistatsis among loci inside one gene coding for one protein is through structure.[116] Often this is pictorially motivated by a mutation changing charge, hydrophilic/hydrophobic or size of one member of a residue pair, which then changes the relative fitness of variants (alleles) of the other member of the pair. In certain other cases dependencies discovered by DCA can be attributed to other causes than structure[117,118] but those cases appear to be relatively rare.
Many details are needed to turn the above to powerful tool in protein structure prediction. One aspect is that proteins in a family typically have different lengths, and that therefore the N ⋅ L table is not directly taken from the data, but only after multiple sequence alignment, which has to be carried out with the help of bionformatics software, or the ready alignment taken from a data base such as PFAM.[119,120] Another is that predicting contacts are only one ingredient in a much larger computational pipeline which uses inter-molecular force fields, predictions on secondary structure and solvability and know-how developed in the protein science community over many years. Still, impressive results have been achieved.[121–123] It should be noted that if the goal is to predict protein structure a purely data-driven approach is possible, where a model of the deep neural network type is trained on large training sets comprised of sequence-structure pairs. As has been widely reported, such an approach from Google Deep Mind currently outperforms model-based learning methods such as DCA for this task.[124,125] The price is computational cost beyond what most academic researchers can afford, and lack interpretability of the inferred model, which could be close to Eq. (
Beyond protein structures, DCA has been used to predict nucleotide-nucleotide contacts of RNAs,[126] multiple-scale protein-protein interactions,[118] amino acid–nucleotide interaction in RNA-protein complexes,[127] interactions between HIV and the host immune system,[128–130] and other synergistic effects not necessarily related to spatial contacts.[131–133] Of particular relevance for the following are applications of DCA to whole-genome sequence data from bacterial populations, in Ref. [134] on Streptoccoccus pneumoniae and in Ref. [135] on Neisseria gonorrhoeae. Standard versions of DCA are rather compute-intensive for genome-scale inference tasks, but methodological speed-ups[136,137] and alternative approaches[138] have been quickly developed. Antibiotic resistance is an important medical problem throughout the world, and so is the relative paucity of new drugs. Combinatorial drug combinations are therefore promising avenues to look for new treatment strategies. The obstacle is the combinatorial explosion of combinations: if there are L potential individual targets there will be L2 potential target pairs, and so on. The hope is that DCA could be one way (one out of many) to predict which combinations may have an effect on the grounds that they are already reflected as epistasis in natural sequence data. In that respect it was promising that Skwark et al. in Ref. [134] were able to retrieve interactions between members of the penicillin-binding protein (PBP) family of proteins; resistance to antibiotics in the β-lactam family of compounds is in S. pneumoniae associated to alterations in their target enzymes, which are the PBPs.[139]
Evolution is a dynamic process. We should imagine that the biological sequence data used in DCA are as in Section
We will structure the discussion as follows. In Section
That there exist formal similarities between the dynamics of genomes in a population and entities (spins) in statistical physics has been known for a long time. Fokker–Planck equations to describe the change of probability distributions over allele frequencies were introduced by Fisher almost a century ago,[140,141] and later, in a very clear a concise manner, by Kolmogorov.[142] The link has been reviewed several times from the side of statistical physics, for instance in Refs. [143,144]. Central to the discussion in the following will be recombination (or sex), by which two parents give rise to an offspring, the genome of which is a mixture of the genome of the parents. From the point of view of statistical physics, recombination is a kind of collision phenomena. It therefore cannot be described by linear equations (Fokker–Planck-like equations) but can conceivably be described by nonlinear equations (Boltzmann-like equations). The mechanisms to be discussed are of this type, where Boltzmann’s Stosszahlansatz is used to factorize the collision operator.
All mammals reproduce sexually, as do almost all birds, reptiles and fishes, and most insects. Many plants and fungi can reproduce both sexually and asexually. Recombination in bacteria is much less of a all-or-none affair. Typically only some genetic material is passed from a donor to a recipient, directly or indirectly. The main forms of bacterial recombination are conjugation (direct transfer of DNA from a donor to a recipient), transformation (ability to take up DNA from the surroundings), and transduction (transfer of genetic material by the intermediary of viruses). The relative rate of recombination in bacteria varies greatly between species, and also within one species, depending on conditions. As one example we quote a tabulation of the ratio of recombination to mutation rate in S. pneumoniae, which has been measured to vary from less than one to over forty.[145] There are long-standing theoretical arguments against the possibility of complex life without sex, as a consequence of Eigen’s “error catastrophe”.[146,147] It is likely that most forms of life use some form of recombination, albeit perhaps not all the time, though the relative rate of recombination to other processes may be small. In the following we will eventually assume that recombination is faster than other processes, which may be as much the exception as the norm in bacteria and other microorganisms. Such a “dense-gas” (using the analogy with collisions) is however where there is an available theory which can be used at the present time.
The driving forces of evolution are hence assumed to be genetic drift, mutations, recombination, and fitness variations. The first refers to the element of chance; in a finite population it is not conformed which genotypes will reproduce and leave descendants in later generations. The last three describe the expected success or failure of different genotypes.
Genetic drift can be explained by considering N different genomes s1,…,sN. Under neutral evolution all genomes have equal chance to survive into the next generation, but that does not mean all will do so. In a Wright–Fisher model one considers a new generation with N new genomes
Mutations are random genome changes described by mean rates. A model of N individuals evolving under mutations and genetic drift which happen synchronously is also called a Wright–Fisher model, and when they happen asynchronously a Moran model.[144] If the genome (or the variability of the genome) consists of only one biallelic locus (one Ising spin, L = 1) then the state of a population will be given by the number k of individuals where the allele is “up” (N – k individuals then have the “down” allele). The dynamics of the Moran model can be seen as the dynamics of N spins where each spin can flip on its own or can copy the state of another spin (or do nothing). It can also be seen as a transitions in a finite lattice where k = 0 means all spins are down, and k = N means all spins are up. The probability distribution over this variable k changes by a Master equation where the variable can take values 0,1,…,N. If mutation rate is zero the two end states in the lattice will be absorbing: eventually all individuals will be up, or all will be down. If mutation rate is non-zero but small, the stationary probability distribution will be centered on small and large k and transitions between the two macroscopic states will happen only rarely. For a very pedagogical discussion of these classical facts we can refer to Ref. [144].
The evolution of the distribution over L biallelic loci under mutations has many similarities to Eq. (
A fitness landscape means a propensity for a given genotype to propagate its genomic material to the next generation. This propensity is a function of genotype, and called a fitness function. It is important to note that this concept does not cover all that fitness can mean in biology. Excluded effects are for instance cyclical dominance where A beats B, B beats C and C beats A.[148–152] We will assume that the fitness of genotype s which carries allele si on locus i depends on single-locus variations and pair-wise co-variations, that is,
Recombination (or sex) is the mixing of genetic material between different individuals. In diploid organisms (such as human) every individual has two copies of each separate component of its genetic material (chromosome), where one comes from the father and one comes from the mother, each of whom also has two copies, one from each grandparent. When passing from the parents to the child the material from the grandparents is mixed in the process called cross-over, so that one chromosome of the child inherited from one parent typically consists of segments alternately taken from the two chromosomes of that parent.
In haploid organisms the situation is both simpler since each organism only has one copy of its genetic material, and also more complicated since the mixing of information can happen in many different ways. It is convenient to postulate a dynamics like a physical collision process
The concept of quasi-linkage equilibrium (QLE) and its relation to sex were discovered by population geneticist Kimura,[162–164] and later developed by Neher and Shraiman in two influential papers.[158,165] We will refer to this theory of QLE as the Kimura–Neher–Shraiman (KNS) theory. To define QLE and to state the main result of KNS we must first introduce the simpler concept of linkage equilibrium (LE), which goes back to the work of Hardy and Weinberg more than a century ago.[166,167]
Consider two loci A and B where there can be, respectively, nA and nB alleles. The configuration of one genome with respect to A and B is then (xA,xB), where xA takes values in {1,…,nA} and xB takes values in {1,…,nB}. The configuration of a population of N individuals is the set
Specifying for completeness to the case of interest here where all loci are biallelic and epistatic contributions to fitness is quadratic (pairwise dependencies), and quasi-linkage equilibrium is a subset of distributions in LD where the joint distribution over loci is the Gibbs distribution in Eq. (
The derivation of Eq. (
Turning around the concepts, Eq. (
The parameter
Nevertheless, Eq. (
We here describe results obtained from simulating a finite population using the FFPopSim software.[168] Partial results in the same direction were reported in Ref. [159]; more complete results, though not using the single-time versions of algorithms as we will here, in Ref. [169].
In a finite population, statistical genetics as described above only holds on the average. When following one population in time, fluctuations of order
The inference of fitness is governed by a set of parameters during the population evolutionary process. The illustrated examples in simulation below contain some fixed parameters, which are the number of loci L = 25, the number of individuals N = 500 (to avoid the singularity of correlation matrices of single generation), the length of generations T = 500 × 5, the crossover rate ρ = 0.5. The varied parameters are the mutation rate μ, the recombination rate r and the strength of fitness σ. In the following we discuss what one can observe by systematically varying these three parameters.
Furthermore, it is of interest to see how the KNS inference theory performs by averaging the results from singletime data. This means that we infer parameters from snapshots, and then average those inferred parameters over the time of the snapshot. The authors of Ref. [169] in contrast studied the inference using alltime versions of the data, where inference was performed only once. In Fig.
When mutation rates are very low, the frequencies of most loci is frozen to 0 or 1 for most of the time. This is a classical fact for evolution of one single locus, as discussed above, but also holds more generally. For an evolving population simulated with the FFPopSim software it was demonstrated in Ref. [169]. In this regime fitness recovery is hence impossible as there is not enough variation. On the other hand, the KNS inference theory does not hold for high enough μ, as one of the assumptions is that recombination is a faster process than mutations. Thus, three points on the μ–r phase diagram are picked with the same μ and differing r’s, marked as sad face, smiling face and not-that-sad face, respectively. The corresponding scatter plots are presented in Fig.
To see if there are differences between the inference with average over singletime and alltime, the corresponding scatter plots of Fig.
Inverse Ising/Potts or DCA has emerged as a powerful paradigm of biological data analysis which has helped to revolutionize protein structure prediction. For the first time it has been shown to be possible to predict protein structure from sequence, though crucially from many similar sequences, not from a single one. The central idea which has made this possible is to exploit statistical dependencies encoded by a postulated Gibbs distribution (
In this review we have striven to put these developments in the context of statistical physics. On some level a distribution over sequences must be arrived at by a evolutionary process, which, though it may be complicated, shares aspects of non-equilibrium spin dynamics. Indeed, these analogies have been noted for a long time, and have been explored from both the viewpoint of (theoretical) population genetics, and statistical physics. We have here added the dimension of learning, how knowledge of the type of dynamics and inference techniques can be used together to deduce biological parameters from data. We have also considered more direct applications of kinetic Ising models to model the evolution of neurons and of economic data, and how to infer connections from such data.
The main conclusions are as follows. First, we have stressed that dynamics that does not fulfill detailed balance can have practically arbitrarily complicated stationary states, even if interactions is only pair-wise. It can therefore not be the case that inverse Ising/Potts can generally give useful information: in the wrong parameter phase it is instead much more likely to yield garbage. The simplest example is inference in asynchronous kinetic Ising models discussed in Section
Second, we have considered evolutionary dynamics in finite populations under selection, mutations and recombination. Following the pioneering work of Kimura and more recently Neher and Shraiman, we discussed how the high-recombination regime leads to a distribution of the type (
Many open questions remain in the field of DCA, out of which we will but discuss some that are closely connected to the main thrust of our argument. The Kimura–Neher–Shraiman (KNS) theory is a huge step forward to an understanding of what is actually inferred for such a procedure, but is obviously only a first step. Most directly, both Figs.
Much work further deserves to be performed to incorporate further biological realism in KNS and/or its successor theories and software. Among the many important effects (most discussed above) which have not been taken into account in this review we list:
Multi-allele loci Realistic mutation matrices that vary over a genome and depending on the transitions Mutations that do not act on single loci, i.e., insertions and deletions (indels) Other models of fitness and other distributions of, e.g., pair-wise parameters fij More realistic models of recombination incorporating also recombination hotspots More types of recombination, as appropriate for bacterial evolution Effects of population growth and bottle-necks.
Many kinds of simulation software has been developed in the computational biology community, for instance, the fwdpp[170] software suitably used recently in Ref. [171]. To objective would not be to redo or replace such software packages, but to reuse them in the context of theory-driven inference.
One further direction important to pursue is the effect of spatial and environmental separation, believed to be a main mechanism behind speciation and the emergence of biological variation in general. Its effects in models of the Wright–Fisher–Moran type were emphasized in Ref. [144]. Spatial separation would in general tend to counter-act recombination, in that individuals which could recombine if they would meet actually are not likely to meet. For instance, a bacterium with one of highest known recombination rates is the human pathogen Helicobacter pylori (the cause of stomach ulcers), but two such bacteria actually can only recombine when they find themselves in the stomach of the same host. Strains of H. pylori can thus be distinguished on a global scale, and only merge when their human host populations overlap.[172]
[1] | |
[2] | |
[3] | |
[4] | |
[5] | |
[6] | |
[7] | |
[8] | |
[9] | |
[10] | |
[11] | |
[12] | |
[13] | |
[14] | |
[15] | |
[16] | |
[17] | |
[18] | |
[19] | |
[20] | |
[21] | |
[22] | |
[23] | |
[24] | |
[25] | |
[26] | |
[27] | |
[28] | |
[29] | |
[30] | |
[31] | |
[32] | |
[33] | |
[34] | |
[35] | |
[36] | |
[37] | |
[38] | |
[39] | |
[40] | |
[41] | |
[42] | |
[43] | |
[44] | |
[45] | |
[46] | |
[47] | |
[48] | |
[49] | |
[50] | |
[51] | |
[52] | |
[53] | |
[54] | |
[55] | |
[56] | |
[57] | |
[58] | |
[59] | |
[60] | |
[61] | |
[62] | |
[63] | |
[64] | |
[65] | |
[66] | |
[67] | |
[68] | |
[69] | |
[70] | |
[71] | |
[72] | |
[73] | |
[74] | |
[75] | |
[76] | |
[77] | |
[78] | |
[79] | |
[80] | |
[81] | |
[82] | |
[83] | |
[84] | |
[85] | |
[86] | |
[87] | |
[88] | |
[89] | |
[90] | |
[91] | |
[92] | |
[93] | |
[94] | |
[95] | |
[96] | |
[97] | |
[98] | |
[99] | |
[100] | |
[101] | |
[102] | |
[103] | |
[104] | |
[105] | |
[106] | |
[107] | |
[108] | |
[109] | |
[110] | |
[111] | |
[112] | |
[113] | |
[114] | |
[115] | |
[116] | |
[117] | |
[118] | |
[119] | |
[120] | |
[121] | |
[122] | |
[123] | |
[124] | |
[125] | |
[126] | |
[127] | |
[128] | |
[129] | |
[130] | |
[131] | |
[132] | |
[133] | |
[134] | |
[135] | |
[136] | |
[137] | |
[138] | |
[139] | |
[140] | |
[141] | |
[142] | |
[143] | |
[144] | |
[145] | |
[146] | |
[147] | |
[148] | |
[149] | |
[150] | |
[151] | |
[152] | |
[153] | |
[154] | |
[155] | |
[156] | |
[157] | |
[158] | |
[159] | |
[160] | |
[161] | |
[162] | |
[163] | |
[164] | |
[165] | |
[166] | |
[167] | |
[168] | |
[169] | |
[170] | |
[171] | |
[172] |